Distances between Distributions: Comparing Language Models
Identifieur interne : 001545 ( Main/Exploration ); précédent : 001544; suivant : 001546Distances between Distributions: Comparing Language Models
Auteurs : Thierry Murgue [France] ; Colin De La Higuera [France]Source :
- Lecture Notes in Computer Science [ 0302-9743 ] ; 2004.
Abstract
Abstract: Language models are used in a variety of fields in order to support other tasks: classification, next-symbol prediction, pattern analysis. In order to compare language models, or to measure the quality of an acquired model with respect to an empirical distribution, or to evaluate the progress of a learning process, we propose to use distances based on the L 2 norm, or quadratic distances. We prove that these distances can not only be estimated through sampling, but can be effectively computed when both distributions are represented by stochastic deterministic finite automata. We provide a set of experiments showing a fast convergence of the distance through sampling and a good scalability, enabling us to use this distance to decide if two distributions are equal when only samples are provided, or to classify texts.
Url:
DOI: 10.1007/978-3-540-27868-9_28
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 001454
- to stream Istex, to step Curation: 001368
- to stream Istex, to step Checkpoint: 000D80
- to stream Main, to step Merge: 001596
- to stream Main, to step Curation: 001545
Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Distances between Distributions: Comparing Language Models</title>
<author><name sortKey="Murgue, Thierry" sort="Murgue, Thierry" uniqKey="Murgue T" first="Thierry" last="Murgue">Thierry Murgue</name>
</author>
<author><name sortKey="De La Higuera, Colin" sort="De La Higuera, Colin" uniqKey="De La Higuera C" first="Colin" last="De La Higuera">Colin De La Higuera</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:27187A35B4B9CB57D3CC0D83A32284926A6E9184</idno>
<date when="2004" year="2004">2004</date>
<idno type="doi">10.1007/978-3-540-27868-9_28</idno>
<idno type="url">https://api.istex.fr/document/27187A35B4B9CB57D3CC0D83A32284926A6E9184/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001454</idno>
<idno type="wicri:Area/Istex/Curation">001368</idno>
<idno type="wicri:Area/Istex/Checkpoint">000D80</idno>
<idno type="wicri:doubleKey">0302-9743:2004:Murgue T:distances:between:distributions</idno>
<idno type="wicri:Area/Main/Merge">001596</idno>
<idno type="wicri:Area/Main/Curation">001545</idno>
<idno type="wicri:Area/Main/Exploration">001545</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Distances between Distributions: Comparing Language Models</title>
<author><name sortKey="Murgue, Thierry" sort="Murgue, Thierry" uniqKey="Murgue T" first="Thierry" last="Murgue">Thierry Murgue</name>
<affiliation wicri:level="3"><country xml:lang="fr">France</country>
<wicri:regionArea>RIM, Ecole des Mines de Saint-Etienne, 158, Cours Fauriel, 42023, Saint-Etienne cedex 2</wicri:regionArea>
<placeName><region type="region" nuts="2">Auvergne-Rhône-Alpes</region>
<region type="old region" nuts="2">Rhône-Alpes</region>
<settlement type="city">Saint-Etienne</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3"><country xml:lang="fr">France</country>
<wicri:regionArea>EURISE, University of Saint-Etienne, 23 rue du Dr Paul Michelon, 42023, Saint-Etienne cedex 2</wicri:regionArea>
<placeName><region type="region" nuts="2">Auvergne-Rhône-Alpes</region>
<region type="old region" nuts="2">Rhône-Alpes</region>
<settlement type="city">Saint-Etienne</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
<author><name sortKey="De La Higuera, Colin" sort="De La Higuera, Colin" uniqKey="De La Higuera C" first="Colin" last="De La Higuera">Colin De La Higuera</name>
<affiliation wicri:level="3"><country xml:lang="fr">France</country>
<wicri:regionArea>EURISE, University of Saint-Etienne, 23 rue du Dr Paul Michelon, 42023, Saint-Etienne cedex 2</wicri:regionArea>
<placeName><region type="region" nuts="2">Auvergne-Rhône-Alpes</region>
<region type="old region" nuts="2">Rhône-Alpes</region>
<settlement type="city">Saint-Etienne</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2004</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">27187A35B4B9CB57D3CC0D83A32284926A6E9184</idno>
<idno type="DOI">10.1007/978-3-540-27868-9_28</idno>
<idno type="ChapterID">28</idno>
<idno type="ChapterID">Chap28</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: Language models are used in a variety of fields in order to support other tasks: classification, next-symbol prediction, pattern analysis. In order to compare language models, or to measure the quality of an acquired model with respect to an empirical distribution, or to evaluate the progress of a learning process, we propose to use distances based on the L 2 norm, or quadratic distances. We prove that these distances can not only be estimated through sampling, but can be effectively computed when both distributions are represented by stochastic deterministic finite automata. We provide a set of experiments showing a fast convergence of the distance through sampling and a good scalability, enabling us to use this distance to decide if two distributions are equal when only samples are provided, or to classify texts.</div>
</front>
</TEI>
<affiliations><list><country><li>France</li>
</country>
<region><li>Auvergne-Rhône-Alpes</li>
<li>Rhône-Alpes</li>
</region>
<settlement><li>Saint-Etienne</li>
</settlement>
</list>
<tree><country name="France"><region name="Auvergne-Rhône-Alpes"><name sortKey="Murgue, Thierry" sort="Murgue, Thierry" uniqKey="Murgue T" first="Thierry" last="Murgue">Thierry Murgue</name>
</region>
<name sortKey="De La Higuera, Colin" sort="De La Higuera, Colin" uniqKey="De La Higuera C" first="Colin" last="De La Higuera">Colin De La Higuera</name>
<name sortKey="De La Higuera, Colin" sort="De La Higuera, Colin" uniqKey="De La Higuera C" first="Colin" last="De La Higuera">Colin De La Higuera</name>
<name sortKey="Murgue, Thierry" sort="Murgue, Thierry" uniqKey="Murgue T" first="Thierry" last="Murgue">Thierry Murgue</name>
<name sortKey="Murgue, Thierry" sort="Murgue, Thierry" uniqKey="Murgue T" first="Thierry" last="Murgue">Thierry Murgue</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001545 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001545 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= ISTEX:27187A35B4B9CB57D3CC0D83A32284926A6E9184 |texte= Distances between Distributions: Comparing Language Models }}
This area was generated with Dilib version V0.6.32. |